Gaussians

You just learned a little about what a Gaussian distribution looks like. As a reminder, a Gaussian curve is sometimes called a bell curve because the shape looks like a bell.

To review, the equation for the Gaussian curve is the following:

$f(x) = \frac{1}{\sqrt{2\pi\sigma^2}}e^{\frac{-(x-\mu)^2}{2\sigma^2}}$

where $\mu$ is the mean and $\sigma$ is the standard deviation.

The standard normal distribution, where $\mu=0$ and $\sigma=1$, is selected for by calling np.random.randn().

You're probably wondering why Gaussian, a.k.a. normal, distributions are so important. The reason is that the distributions of many things follow a normal distribution -- such as the heights of people, manufactured parts, blood pressure readings, and error measurements -- making it important to understand.

There are specific metrics that describe a normal distribution.

1) The mean, median, and mode of a Gaussian distribution are all the same. 2) There is symmetry about the mean, as in 50% of the values fall to the right of the mean and the other 50% fall to the left. 3) A certain amount of data falls within integer multiples of the standard deviation, as shown below.

Does the lifetimes data we plotted earlier hold up to these three criteria? Let's find out.

Remember, the lifetimes data was imported as the variable lifetimes before.


In [2]:
lifemean = np.mean(lifetimes) #get mean
lifestd = np.std(lifetimes) #get standard deviation

Let's examine the first criterion: the mean, median, and mode of a Gaussian distribution are all the same.

To calculate the mode, we need to import another module called the stats module. The median can still be calculated from the numpy module.


In [4]:
#import stats module
from scipy import stats

Now calculate the median and mode of the variable lifetimes and display them.


In [5]:
#your code here
lifemode = stats.mode(lifetimes) #calculate mode
lifemedian = np.median(lifetimes) #calculate median

print(lifemean)
print(lifemode)
print(lifemedian)


12.0260440367
ModeResult(mode=array([ 9.0798]), count=array([2]))
11.838

Does the lifetimes data fulfill the first criterion of a Gaussian distribution?

Now let's check the second criterion. Is there symmetry about the mean?

First, let's find out how many samples are in the variable lifetimes and display it.


In [6]:
#your code here
numsamp = len(lifetimes)
print(numsamp)


327

Now that you have the number of samples, you will need to use the median value to find out how many samples lie above and below it.


In [16]:
#Put your code here

#why doesn't this work?
#uppermask = lifetimes>lifemedian
#upperhalf = lifetimes(uppermask) #this should work, but doesn't?
#lowermask = lifetimes<=lifemedian
#lowerhalf = lifetimes(lowermask) #ditto

#but this does?
upperhalf = [ii for ii in lifetimes if ii>lifemedian] #get upper 50%
lowerhalf = [jj for jj in lifetimes if jj<=lifemedian] #get lower 50%

upperperc = len(upperhalf)/numsamp
lowerperc = len(lowerhalf)/numsamp

print(upperperc)
print(lowerperc)


0.4984709480122324
0.5015290519877675

Does the lifetimes data fulfill the second criterion of a Gaussian distribution?

Now let's check the last criterion. How much data falls within a standard deviation or two (or three)?

Remember, you already calculated the standard deviation of the lifetimes data as the variable lifestd.


In [27]:
#Put your code here

plus_std = (lifemedian+1*lifestd, lifemedian+2*lifestd, lifemedian+3*lifestd)
minus_std = (lifemedian-1*lifestd, lifemedian-2*lifestd, lifemedian-3*lifestd)
aboveperc = [None]*3
belowperc = [None]*3

ii=0
while ii<len(plus_std):
    data_above = [jj for jj in lifetimes if jj>lifemedian and jj<plus_std[ii]]
    aboveperc[ii] = len(data_above)/numsamp
    
    data_below = [kk for kk in lifetimes if kk<=lifemedian and kk>minus_std[ii]]
    belowperc[ii] = len(data_below)/numsamp
    
    ii+=1
    print('% of data within', ii, 'standard deviations of the median:', aboveperc[ii-1]+belowperc[ii-1])


% of data within 1 standard deviations of the median: 0.6880733944954129
% of data within 2 standard deviations of the median: 0.9541284403669724
% of data within 3 standard deviations of the median: 0.9938837920489296

Does the lifetimes data fulfill the third criterion of a Gaussian distribution?